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Abstract 

In their 2008 and 2009 papers, Sumner and colleagues introduced the "squangles" - a 
small set of Markov invariants for phylogenetic quartets. The squangles are consistent 
with the general Markov model (GM) and can be used to infer quartets without the need 
to explicitly estimate all parameters. As GM is inhomogeneous and hence non-stationary, 
the squangles are expected to perform well compared to standard approaches when there 
are changes in base-composition amongst species. However, GM includes the IID assump- 
tion, so the squangles should be confounded by data generated with invariant sites or with 
rate- variation across sites. Here we implement the squangles in a least-squares setting that 
returns quartets weighted by either confidence or internal edge lengths; and use these as 
input into a variety of quartet-based supertree methods. For the first time, we quanti- 
tatively investigate the robustness of the squangles to the breaking of HD assumptions 
on both simulated and real data sets; and we suggest a modification that improves the 
performance of the squangles in the presence of invariant sites. Our conclusion is that 
the squangles provide a novel tool for phylogenetic estimation that is complementary to 
methods that explicitly account for rate-variation across sites, but rely on homogeneous 
- and hence stationary - models. 
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Introduction 



There is a consensus opinion that the most robust and easily interpretable phylogenetic methods 
are based on stochastic models of how sequences evolve. Many simulation studies comparing 
phylogenetic methods have supported the use of model-based likelihood methods over parsi- 



mony (Hillis 1995 Huelsenbeck 1995 Guindon and Gascuel 2003 Gaucher and Miyamoto 



2005). However, the question of whether current models of sequence evolution are adequate is 



still an open one. 

The continuous-time Markov models of nucleotide evolution in common use - namely the 
general-time-reversible model (GTR) and its submodels - are stationary, reversible and homo- 



geneous (Jermiin et al. , 2008). Stationarity implies that the expected base-composition should 



be constant across the tree. However, in real data significant differences in base-composition 



have been observed (Lockhart et al. , 1992 Foster and Hickey 1999; Phillips and Penny, 2003 



Gruber et al. 2007), and it has been shown that species with similar base-composition tend 



to group together regardless of their evolutionary history (Jermiin et al. 2004; Phillips et al. 



2004 Davalos and Perkins, 2008) - although the base-composition bias may have to be quite 
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pronounced before the accuracy of maximum-likelihood {ML) significantly diminishes (Galtier 



and Gouy, 1995). If the true evolutionary process was heterogeneous across the tree, Sumner 
et al. ( 2012a[b ) have shown that GTR has an undesirable property that makes a consistent 



interpretation difficult. 

To accurately capture biological processes, more parameter-rich models than the GTR fam- 
ily may sometimes be required. For example, Galtier and Gouy (1995 1998) explore models 
that incorporate differences in base-composition. Other researchers have discussed using mix- 



tures of rate matrices from the GTR class, either as mixtures across the tree (Pagel and Meade 



2004|, or as a temporal hidden Markov model where different rate matrices may apply in differ 



ent parts of tree (Whelan, 2008). Barry and Hartigan ( 1987a|[b ) suggested the general Markov 
model (GM) for use in phylogenetics. This model allows distinct Markov matrices to be as- 
signed to each edge in the tree, with only the restriction that the entries of the matrix be 
interpretable as substitution probabilities. 

Implementing GM for phylogenetic inference is problematic because it is very parameter 
rich. Even for just three taxa, there are 39 free parameters; and each additional taxon presents 
another 24 parameters to be estimated. For a homogeneous implementation of GTR the corre- 
sponding count is nine free parameters with each additional taxon requiring the addition of only 
two more. Models that are very parameter rich lose statistical power: "... extensive addition of 
parameters comes at a price - the predictive power of the theory (the information that the data 
can reveal about the underlying tree) tends to be drowned out in a sea of parameter estima- 
tion" (Steel, 2005). This is the well-known statistical issue of bias- variance tradeoff; where, for 
parameter-rich models, parameter estimates may be relatively unbiased but will have relatively 
high variance (Burnham and Anderson 2002). 

The worst case for the bias-variance tradeoff occurs when a phylogenetic model is not 
"identifiable" . Identifiability ensures that the parameter values that are input to the model can 
be computed exactly from the resulting probability distribution of site patterns. If a model is 
not identifiable, then there are parameters in the model whose estimation variance is formally 
infinite. Chang (1996) proved that GM is identifiable and AUman and Rhodes (2008) extended 
the result to include invariant sites (although both results are contingent on potential "label 
swapping" of character states at the internal vertices of the phylogenetic tree). 

There has been interest in implementing both GM (Jayaswal et al. , [2005 ) and GM-I-I 



(Jayaswal et al. , 2007) in an ML context. By conducting a simulation study on triples, Os- 



camou et al. (2008) compared five different methods for phylogenetic estimation under GM; 



their results suggested that the methods presented in Goldman et al. (1996) and Gojobori et al. 



(1982) - as implemented in Knight et al. (2007) - provide a reasonable compromise between 
computational efficiency and statistical accuracy. In contrast, Zou et al. (2011) suggest that the 
"label swapping" proviso causes heuristic search methods to frequently get stuck in parameter 
regions that correspond to sub-optimal likelihood values. 

Taking these considerations into account, it is of interest to develop phylogenetic methods 
that are both consistent with more general models of sequence evolution and avoid the dangers 



of over-parametrisation. The first advance in this regard was the logDet distance (Lockhart 



et al. 


1994 


Lake 


1994; 



can be computed as d^y = — log(det(F^y)) where F^y is the 4x4 divergence matrix for the 
sequences x and y. For example, the matrix element {Fxy)^(j counts the number of sites where 
an A was observed in sequence x and a C was observed for sequence y. This distance can be 
used to consistently estimate tree topology, although as noted in Lockhart et al. (1994) if a 
consistent estimate of edge lengths is desired then the formula requires an extra term involving 
the base-composition at each of the taxa. These pairwise distances can then be used to build 
large trees by using distance-based methods such as Neighbour Joining (iVJ). Bypassing the 
need to estimate a complete set of model parameters, logDet+NJ consistently estimates tree 
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topology under GM (Steel, 1994) - thus providing a sensible compromise to the bias- variance 
tradeoff. 

The logDet function is constructed from the simplest example of a "Markov invariant" 



(Sumner et al. , 2008). From this point of view, the defining feature for the logDet function is 
the following property of det{Pxy) - where the divergence matrix F^y has been replaced with the 
corresponding theoretical probability distribution P^y arising under GM. Consider an edge e 
that lies on the path between taxa x and y with associated Markov matrix Mg. If we extend this 
edge by inserting an additional Markov matrix M'^ so that Me — ^ M'^Me, then the consequent 
change in det^P^y) is given by det^P^y) — )• det(Mg) det(P3;j^). Recall that in a continuous- 
time formulation, = e'^<=*= for some rate matrix Q'^ and time tg, so — log(det(Mg)) = 
(sum of rates in Q'^) t'; whence the interpretation of the logDet as a distance measure consistent 
with GM. 

The theory of Markov invariants applies exactly this idea to larger subsets of taxa: polyno- 
mial functions of phylogenetic divergence arrays that are multiplied by a power of the deter- 
minant of the Markov matrix that extends a pendant edge. The restriction to pendant edges 



is not relevant for the logDet but is pivotal in the general case. Sumner et al. (2008) formalise 
this definition and describe methods to calculate the number of Markov invariants that exist 
for a given number of taxa and character states. In particular, for quartet trees under GM on 
four character states - i.e. DNA, they show there are four Markov invariants; and they refer 
to these as the "squangles" . (The term squangle derives from stochastic quartet tangle; more 



explanation is given in Sumner (2006). 



For a quartet with pendant edges 



'XI ^yi 



and the defining feature of the squangles is 



their behaviour when these edges are extended by inserting the additional Markov matrices 
M[^, M[^, M'^^ and M'^^. If q{Pxyzw) is the value of a squangle before the extension, then 

q{P.yzu,) ^ det(M;j det(M;j det(M;j det(M;jg(P,y,^). 

It is precisely this behaviour that makes the squangles analogous to the "Det" part of the 
logDet function. 

It is important to note that Markov invariants are in general different from phylogenetic 



invariants ( 


Cavender and Felsenstein, 


1987 


Evans and Speed 


1993 


1987 


Steel et al. 




1993 


). For a given model of sequence evolution 



Felsenstein 1991; Lake 



phylogenetic invariants 

are polynomial functions that evaluate to zero for a subset of phylogenetic trees regardless of 



particular model parameters. Sumner and Jarvis (2009) use the symmetries of leaf permutation 



on quartets to show that, for each possible quartet topology, the squangles can be arranged 
into a prescribed basis where two of them form phylogenetic invariants. This amalgamation of 
the properties of Markov and phylogenetic invariants is a very special circumstance and is not 
achievable in general. 



Following the investigation of the effect of base-composition bias given in (Jermiin et al. 



2004), Sumner et al. (2008) conducted simulations showing that the squangles can be used to 
infer quartets accurately under both conditions of extreme base-composition bias and relatively 
short sequence lengths (performing at least as well as logDet+NJ). This is consistent with the 



results presented in Casanellas and Fernandez-Sanchez (2007), and is in contrast to previous 



analyses of phylogenetic invariants which have generally shown low-power compared to other 



phylogenetic methods (Huelsenbeck and Hillis 1993) 

As with the logDet, the squangles are not designed to handle situations where the data 
arises under a process that breaks the IID assumptions, i.e. invariant sites or rate-variation 
across sites. For the first time, we explore to what extent the squangles are robust to such 
violations. This kind of "tool abuse" is common in phylogenetic practice. As alluded to above, 
maximum-likelihood methods are often used in situations where the models are clearly not great 
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matches to reality; and simulations have been used to suggest that accuracy of phylogenetic 
inference under ML is reasonably robust to many kinds of model violation. 

In this paper we aim to determine if the squangles can become a practical tool for phyloge- 
netic estimation. In particular we address the following questions: 

1. How can we measure our confidence in the quartet tree returned by the squangles? 

2. How can we best use the squangles in combination with existing software to perform 
phylogenetic estimation for many taxa? 

3. In what settings would the squangles outperform NJ with distances computed assuming 
a stationary model, NJ using logDet distances, or ML assuming homogeneous models? 

4. How robust are the squangles to the presence of invariant sites? 

5. Can the squangle method be modified to make it robust to invariant sites? 



Methods 
The squangles 

A DNA sequence alignment for four species can contain up to 4^ = 256 site patterns (we assume 
no gaps, missing data or ambiguous characters). The relative frequencies (or proportions) of site 
patterns can be summarised in a vector f of length 256. The squangles are then homogeneous 



polynomials of degree 5 in these pattern frequencies. Using the basis prescribed in Sumner and 



Jarvis (2009), there are three squangles that are useful for phylogenetic quartet estimation: 
here denoted as gi(f), 5'2(f), and q'3(f). Each of these has 66,744 terms, and together they 
satisfy the linear relation qi + q2 + = (which is to say that up to linear dependence there 
are only two of them). 



Sumner and Jarvis (2009) give algebraic expressions for the squangles, and derive their 
expectation values when the pattern frequency vector f arises from GM on a given quartet 
topology. These expectation values are given in Table [Tj For instance, if the pattern frequency 
f has been generated on the quartet 12|34 - meaning that the middle edge splits taxa 1 and 2 
from taxa 3 and 4, qi has expectation value E[qi\ = 0, q2 has non-positive expectation value 
E[q2] = —u < 0, and q^ has expectation value Elq^] = u (equal in magnitude but opposite 
in sign to the value for q2). The numerical value of u will depend on the particular Markov 
matrices that underlie the process that generates the data. We will discuss this in detail below; 
but note here that the symmetries inherit in Table [T] imply that u = v = lu = Q only in the 
case of a star-tree. 



Hypothesis 


Quartet 


E[q,] E[q2] E[qs] 


-Hi 


12|34 


—u u 


■H2 


13 24 


V —V 


^3 


14 23 


—w w 



Table 1: Expected values of each of the squangles on each of the three possible quartet topolo- 
gies, with u, V, w > 0. 

The expectation values of the squangle qi and the linear combination q2+q3 are zero on 12|34. 
Hence these are phylogenetic invariants for that quartet (with complementary phylogenetic 
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invariants occurring for the other two quartets). As can be confirmed from Table [T| the relation 
Q'l + ?2 + ?3 = implies that E[qi + q2 + qs] = E[qi] +E[q2] +E[q3] = for each quartet topology. 
Recall that Steel] (1994) used the properties of the logDet distance to prove that (unrooted) 



tree topology is identifiable under GM. The existence of the squangles and their properties as 
presented in Table [T] provides an independent verification of this result. 

Least-squares implementation of the squangles 

Given an arbitrary site pattern frequency vector f, the squangles qi{i), '?2(f), and qs^i) will 
evaluate to values that will not exactly match any of the three scenarios given in Table 1 (they 
will do so only for infinite sequence lengths and when the process that generated f was precisely 



GM on a fixed quartet). Following Sumner et al. (2008), we consider each quartet as a distinct 



hypothesis (see Table [T]) and use a least-squares framework to decide which hypothesis best 
explains the observed values of the squangles. We must be careful to take into consideration the 
linear relation qi + q2 + q3 = 0, which implies that, for each quartet, we should consider only two 
of the squangles as independent. To achieve this in a way that is consistent when considered 
across the three quartets, for each case we ignore the squangle with vanishing expectation 
value. For instance, for the quartet 12|34, we ignore qi and define the residual sum of squares 
as RSSi = (gs + uf + (ga - u^. 

We want to find the least-squares estimate for u subject to the constraint u> 0. To do so, 
we substitute = u, differentiate the residual with respect to x and set the result equal to 
zero: 

= 4:x{q2 + X ) — 4a;(g3 — x ) = 0. 



dx 

Solving gives two solutions u = x^ = and u = x^ = ^{q^ — ^2)- By checking the sign of the 
second derivative we determine that when q2 > qs the minimum occurs at m = 0, and when 
'?2 < ^3 the minimum occurs at m = ^(gs — 52)- If i?2 = gs then both solutions are the identical 
and the minimum is at m = 0. Note that these conditions ensure that m > (as they must 



since u = x^). 

Applying the complementary procedure to define residuals for the other two quartets - in 
each case ignoring the squangle with vanishing expectation value - we obtain the following least- 
squares estimates for the parameters u, v and w under the hypotheses of the three respective 
quartets: 

u = max{0, |(g3 - ^2)}, 
V = max{0, |(gi - gs)}, 
w = max{0, |(g2 - gi)}- 

Given the least-squares estimates of u, v, or w for each hypothesis. Table |2] displays the 
corresponding RSS values for each of the 6 possible configurations of the squangles as ordered 
on the real number line. We deem the hypothesis with the strongest support to be the quartet, 
or quartets in the case of ties, with the minimum RSS. 

If, as presented in Table [T| the observed values of gi, g2, and ga match the order defined 
by the expectation values for a particular quartet, then Table |2] shows that the RSS for that 
quartet will always be lower than the RSS for the other two topologies. For example, the 
configuration g2 < gi < gs in row 1 of Table [2] matches the order given in row 1 of Table [Tj 

If the configuration of gi, g2, and q^ doesn't match one of the orders displayed in Table [1} 
i.e. rows 4-6 of Table [2| then one hypothesis is excluded - the one where the ordering is 
"completely wrong" - and, out of the remaining two hypotheses, the minimum RSS occurs for 
the hypothesis T-Li whose corresponding squangle g, is closest to zero. 
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Hi 


= 


12|34 


^2 


= 


13|24 


^3 


= 


14|23 


Ordering 




u 




RSSi 






RSS 


W 




RSS 


q2<qi< 


qs 




q2) 


2^1 







ql + ql 







+ ^2 


gs < ^2 < 


qi 







ql + ql 




q3] 


hi 







+ ^2 


qi<q3< 


q2 







ql + ql 







q\ + ql 






¥3 


q3<qi< 


q2 







ql + ql 




q3] 


H 






hi 


qi<q2< 


qs 




q2) 


2^1 







ql + ql 






hi 


q2<q3< 


qi 




q2) 


2^1 




q3] 


hi 







ql + ql 



Table 2: Least-squares estimates and corresponding residual sums of squares for each of the 
three possible quartet hypotheses. Note that the first three rows give orderings on the squangles 
that match the ordering of expectation values presented in Table [Tj 



In the case of a single equality between gi, q2, and gs, the configuration matches two of the 
orders displayed in Table [2] one matches an order displayed in Table [T] and one does not. It is 
easy to check that the same minimum RSS occurs for both possibilities. 

The only circumstance where the squangles are equal occurs when qi = q2 = q3 = 0- In this 
case the RSS values of the hypotheses are also equal. 

Figure [T] summarises the information in Table [T] as it is interpreted on the q2-q3 plane, 
showing the regions where the different configurations apply and the different quartets that are 
selected. 



Accounting for invariant sites 

As GM includes the IID assumption, the squangles are not expected to perform well in cases 
when the data are generated under a process with significant numbers of invariant sites or when 
there is rate-variation across sites. To account for the possibility of invariant sites, we suggest 
a simple modification to the calculation of the squangles. Firstly, the proportion u of invariant 



sites is estimated using the method given in Steel et al. (2000). Then the observed proportions 
of constant site patterns - i.e. the patterns AAAA, CCCC, GGGG and TTTT - are rescaled 
by multiplying hj l — u, and the pattern vector f is renormalised to sum 1. The altered pattern 
vector f is then used as input to the squangle method with no further modification. 



Quartet weights 

Several of the methods proposed for constructing trees based on sets of quartets rely on the 



quartets having weights, for example, QuarietSuite (Willson, 1999), Quartet MaxCut (Snir and 



Rao, 2012) and QNet (Griinewald et al. , 2008). Although they currently do not require them. 



presumably supertree methods such as Matrix Representation with Parsimony [MRP) (Baum 



1992 


Ragan 


Steel 


2006 


) 



of such weights. In most cases the weights are required to be proportional to the confidence 



in a particular quartet topology. An exception is the program QNet (Griinewald et al. , 2008), 
where the quartet weights are required to be proportional to the length of the internal edge of 
the quartet. 

For the squangle method implemented using least-squares, we take the RSS values to be 
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Figure 1: Regions in the q2-q3 plane. Different regions of the plane correspond to different 
configurations qi,q2, and ^3 (indicated by grey arcs), and to different tree topologies (indicated 
by shading). Bold black lines correspond to squangle values that exactly match one of the three 
tree topologies. 



inversely proportional to our confidence in the corresponding quartets: 

Rssr^ 



RSSi-^ + RSS2~^ + RSS3 



1 ' 



with Wi + W2 + Ws = 1- This definition can be justified by supposing that, for fixed sequence 
length and the quartet 12|34, q2 and q^ are independent and normally distributed around their 
respective means {—u and u) with identical standard deviation a. Thus the joint probability 
density function is given by 

^ V ( {Q2 + u)^ + {qs - u) 



p{q2,q3;u,a) = — =^ exp 



27TaJ V 2a2 



Under this assumption, it is clear that the maximum-likelihood estimates are given exactly as 
hil3 — 92) and (T^ = ^"^"^^ = ^qf = l{q2 + q?,Y- As is well known, for these assumptions u is 



exactly the same as the best estimate arising from the least-squares procedure. Given observed 
values q2 and qz, the value of the probability density function at the maximum- likelihood 
estimates is proportional to the residual sum of squares: 

. \ 1 ^ ( {Q2 + \{qz-q2)f + {q3-\{q3-q2)f \ 1 
p{q2,q.;u,a) = | ^ j exp |^ ^ J = -RSS, . 
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From this we conclude that our definition of the weights Wi is consistent with the assumption 
that the squangles are independent and normally distributed around their respective means 
with identical variance. We present a simulation study below that provides evidence that this 
assumption is reasonable. While this gives a plausible theoretical justification of the least- 
squares approach and the interpretation of the weights Wi, we emphasise that our primary aim 
in this paper is to use simulation and examples from real data to show that the squangles 
provide a robust phylogenetic tool. 

It is apparent from the symmetries in Table [T] that u (or v or w) must be correlated to 



the internal edge length of the quartet. However, as presented in Sumner and Jarvis (2009), 



the precise relationship under the full assumptions of GM is too complicated to be useful. If 
we simplify the assumptions of GM on a quartet so on the internal edge we restrict to the 
Jukes-Cantor (JC) model (Jukes and Cantor, 1969), then it is possible to derive an explicit 



expression for the relationship between the internal edge length and the expectation values u, 
V and w. 

To achieve this, fix the quartet 12|34 and suppose that the JC substitution matrix on the 
internal edge has off-diagonal entries equal to a/3. As is shown in the appendix, the expectation 
value of the squangle gs can then be written as a quintic function of a: 



u 



273' 



-a(3^ - 225a + 276a^ - 154a^ + 32a^) JJ det(Mi 



(1) 



where Mi to M4 are GM substitution matrices on the pendant edges. 

Using the approach that forms the foundation of the logDet distance, the determinants of the 
substitution matrices that appear in ([T]) can be estimated, as follows. Let F^y be the 4x4 diver- 
gence matrix for taxa x and ?/, and let P^y be the corresponding theoretical probability distribu- 
tion under the assumption of GM on pendant edges and JC on the internal edge. Assuming the 
quartet 12|34, we have det(Pi2) = det(Mi) det(M2)(^)^ and det(P34) = det(M3) det(M4)(|)^ 
Rearranging and substituting this into equation ([T]) gives 



7(a) - E 



Q3 



det(Pi2)det(P34) 



where 7(a) := |^a(3^ — 225a -|- 276a^ — 154a^ + 32a^). Considering the analogous scenario on 
the other quartets 13|24 and 14|23, we have 



7(a) - E 



det(Pi3)det(P, 



24 



0, 



7(a) - E 



(I2 



det(Pi4)det(P23), 



0. 



respectively. 

We can use the observed pattern frequencies f to estimate the probability distributions P12 
and P34, and either, assuming a is small, a quadratic approximation or a general root finding 
algorithm to find the smallest positive root of the above equation. In this way, we obtain an 
estimate a of the internal edge length of a quartet tree under the assumption of JC on the 
internal edge and GM on the pendant edges. 

We see that the squangle method is capable of returning two types of quartet weights. 
For each quartet, the first gives a confidence value, and the second gives a measure of genetic 
distance along the internal edge. 

Below we compare these weights to those computed from maximum-likelihood inference. 



where, as is described in Strimmer and Rambaut (2002), weights are calculated as 



Li -\- L2 -\- 



where Lj is the likelihood of quartet i. 
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A 
C 
G 
T 



A 

1 — a 



ab/{2b+l) 
1 — a 



C 



G 



ab/{2b + l) 
ab/{b + 2) 
1 — a 



T 



a/(26 + l) 
a/(6 + 2) 
a/(6 + 2) 
1 — a 



a/{b + 2) 
a/{b + 2) 
a/(26+ 1) 



a6/(6 + 2) 
ab/l2b+l) 



ab/{2b + l) 



Table 3: Structure of the substitution matrices used in the simulations, b > 1 parametrises GC 
bias. 



Simulation study 

We conducted two simulations to assess the assumption that, taken in pairs, the squangles 
are independent and normally distributed about their respective means with identical standard 
deviation. We simulated data on a star tree under the JC model, where M, the substitution 
matrix for each pendant edge, had 0.9 for diagonal entries (i.e. in terms of Table |3| a = 0.1 and 
b = 1). We next simulated data on the quartet 12|34 where Mg, the substitution matrix for 
each short edge, had a = 0.02 and b = 1, and Mi, the substitution matrix for each long edge, 
had a = 0.4 and 6 = 5. We recorded all the values of q2 and qs along with the tree topology 
selected and plotted these values on the q2-q3 plane (as shown in Figure [2]) along with their 
marginal distributions. 

We tested the strength of the relationship ([T| between the squangle values and internal edge 
lengths for finite data. This was done by conducting simulations with sequence length set at 
1000 sites under JC - so all edges had 6=1 (i.e. no GC bias). On the pendant edges we set 
a = 0.1 and on the internal edge a was taken in the range 0.01 to 0.20. For each value of a, we 
performed 1000 repetitions. 

Our primary simulations tested the robustness of the least-squares implementation of the 
squangle method, and were conducted on (i) the "Standard" tree, which has has one short 
internal edge and 4 long pendant edges of equal length, (ii) the "Felsenstein" tree, and (iii) the 
"Farris" tree. The Felsenstein and Farris trees each have a combination of 3 short and 2 long 
(external) edges: on the Felsenstein tree, the long edges are two non-sister external edges, and 
on the Farris tree they are two sister external edges. Parameters varied were: 

1. Sequence length c G {500, 1000}; 

2. GC bias b E {1,3,5,7,9}; 

3. Short edge: a E {0.01,0.02,0.04}; 

4. Long edge: a E {0.2,0.4}; 

5. Proportion of invariant sites u E {0, 0.25, 0.5, 0.75} 

Each parameter setting was run 1000 times and trees were estimated using: the least-squares 
implementation of the squangles SQ; least-squares squangles modified to down- weight the ob- 
served frequency of constant sites SQi; neighbor-joining NJ, where distances were estimated 
under the JC model; NJ using logDet distances NJld; maximum-likelihood under the JC model 
ML; maximum- likelihood under the JC model plus invariant sites MLi; and maximum parsi- 
mony MP. 

Supertrees or supernetworks 

Most phylogenetic data sets of interest contain more than four taxa, so to perform inference 
using the squangles it is necessary to break the data into groups of four taxa, infer a quartet 
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for each group of four taxa, and then piece these quartet trees back together into an overall 
tree (or perhaps network). We have produced Python code that reads a sequence alignment in 



Phylip ( Felsenstein , 1989) format and uses the squangles to calculate a set of weighted quartets. 



The squangles cannot deal with data that includes gaps or ambiguous characters, so there is 
an option to either remove such sites at the beginning of the analysis from the whole alignment 
(faster), or on a quartet by quartet basis (slower but retains more information). There is also 



an option to account for invariant sites by implementing SQi; this uses the method of Steel 



et al. (2000) and hence must be done on a quartet by quartet basis. Options for output format 



include: 



A simple list of the best tree for each quartet (this could be used as input to MRP, for 



instance using Clann (Creevey and Mclnerney, 2005), or could be opened in SplitsTree4 



(Huson and Bryant, 2006) to create a Z'-closure network); 



• A list of each quartet with confidence higher than some threshold value (this could be 
used as above); 

• A file in QuartetSuite format including three confidence weights for each set of 4 taxa; 

• A file in QNet format that includes three distance weights for each set of 4 taxa. 
Python code is available from BRH at harhara.holland@utas.edu.au. 

Case studies 

We applied the squangle method to two biological data sets that have been discussed in the 
context of compositional heterogeneity. 



Mammal data 



Phillips and Penny (2003) used a data set consisting of mtDNA for mammals (and out groups) 
to test evidence for Theria (a hypothesis which groups Marsupials and Placentals) against 
Marsupionta (a hypothesis that groups Marsupials and Monotremes). The data set contains 
two monotremes (Platypus and Echidna), nine placental mammals, four marsupials (Opossum, 



Bandicoot, Brushtail, Wallaroo) and ten outgroup taxa (birds and reptiles). Phillips and Penny 



(2003) found RY coding of 3rd codon positions decreased support for Marsupionta in compar- 



ison to Theria. Since Theria is more frequently supported by analyses of nuclear genes, they 
argued that the support for Marsupionta could be an artefact of base-composition bias. We 
used the squangle method to evaluate support for 3 competing hypotheses: 



Marsupionta 

Theria 

Other 



(Outgroup, ((Monotreme, Marsupial), Placental)); 
(Outgroup, ((Placental, Marsupial), Monotreme)); 
(Outgroup, ((Placental, Monotreme), Marsupial)). 

We investigated if squangles appeared more robust to the apparent base-composition bias 
than a standard likelihood approach. We compared ML trees to squangle-based trees for the 
2 • 10 • 4 • 9 = 720 quartets that include one taxon from each of the four groups. Following 



Phillips and Penny (2003), ML was performed using the TN93 model (Tamura and Nei 1993) 



with and without allowing for invariant sites. We used the squangles with and without the 
invariant sites corrections, assigned confidence-style weights to each quartet and analysed each 
codon position separately. This allowed us to test if high- confidence quartets were more likely 
to support any particular hypothesis. 



10 



Additionally, we ran three analyses - one for each codon position - on all of the (^^^) = 12, 650 
subsets of four taxa. Quartets with confidence weight greater than 0.95 were retained and used 
as input to the MRP supertree method. 

Possum data 



Gruber et al. (2007) investigated the phylogenetic relationships for 41 species of possums. They 
found that the gene RAG gave very different results to all the other genes previously inves- 
tigated and to morphological data. With reference to their Figure 1, all the data sets they 
analysed gave a clade they label B (which contains 10 species from the genera Gracilinanus, 
Cryptonanus, and Thylamys) , and a clade they label I (which contains 8 species from the gen- 
era Micoureus and Marmosd). For all the data sets apart from RAG they found that clade 
B was sister to the Marmosops genus (5 species), with clade I grouping elsewhere in the tree. 



However, for the RAG gene clade B and clade I were sister to each other. Gruber et al. (2007) 



concluded that the grouping of clade B with clade I for the RAG gene was the result of a 



base-composition artefact. Using the same approach as described for the Phillips and Penny 



(2003) data, we evaluated support for 3 competing hypotheses for each of the 10-8-5 - 18 = 7200 



quartets that include one taxon from each of the 4 groups: 

True ((B, Marmosops), I, Remainder); 
GC-bias ((B, I), Marmosops, Remainder); 
Other ((B, Remainder), Marmosops, I). 

Likelihoods were calculated under the GTR model both with and without invariant sites. 
Again, we investigated if high-confidence quartets were more likely to support the presumed 



true tree. For the Gruber et al. (2007) data set we also investigated 1st and 2nd codon positions 
combined, and all three codon positions combined. 

For the RAG gene and the non-RAG genes we also ran three analyses each - one for each 
codon position - of all C^^^) = 101,270 quartets. Quartets with confidence weight greater than 
0.95 were retained and used as input to the MRP supertree method. 



Results 
Simulation study 

Figures |2](a) and (b) show results of 1000 simulations with sequence lengths of 1000 sites. The 
values of the squangles q2 and are plotted in the plane and the marginal distribution for each 
squangle is shown as a histogram. Figure [2]^a) presents data simulated on a star tree under the 
JC model (ie. 6 = 1), where M, the substitution matrix for each pendant edge had a = 0.1. 
The correlation between q2 and q^ was r = —0.519. Figure is for data simulated on the 
quartet 12|34 for the a Felsenstein tree. Mg, the substitution matrix for each short edge, had 
a = 0.02 and b = 1, and Mi, the substitution matrix for each long edge, had a = 0.4 and 6 = 5. 
The correlation between q2 and gs was r = —0.053. 

We conclude that the distribution of both q2 and q^ seems to be approximately normal, 
but with some skew (particularly for the Felsenstein-tree simulation) and some violation of 
independence (particularly for the star-tree simulation). 

Figure |3] shows how gs varies with internal edge length compared to the theoretical expec- 
tation under the JC model. At sequence length c = 1000 there is good agreement between the 
mean of the 1000 simulations for each value of a and the theoretical expectation. 

Figure |4] shows the results for short edge a = 0.02, long edge a = 0.4, and sequence length 
c = 1000. The SQ method suffers markedly from long branch attraction on the Felsenstein tree 
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Table 4: Analysis of 720 quartets from the Phillips and Penny (2003) data set by codon posi- 



tion. Numbers of each of the three possible quartet topologies are shown with high-confidence 
quartets (> 0.95 confidence weight for SQ and SQi, and > 0.95 likelihood weight for ML and 
MLi) shown in brackets. 



position 


1st 






2nd 






3rd 






method 


Marsupionta 


Other 


Theria 


Marsupionta 


Other 


Theria 


Marsupionta 


Other 


Theria 


ML 


606 (553) 


47 (26) 


67 (42) 


477 (395) 


210 (159) 


33 (15) 


322 (174) 


184 (84) 


214 (103) 


MLi 


530 (301) 


85 (16) 


105 (30) 


273 (87) 


372 (85) 


75 (6) 


310 (159) 


195 (88) 


215 (104) 


SQ 


604 (190) 


57 (10) 


59 (7) 


549 (172) 


157 (10) 


14 (2) 


268 (54) 


223 (33) 


229 (32) 


SQi 


445 (120) 


166 (27) 


109 (16) 


321 (74) 


351 (99) 


48 (9) 


235 (50) 


234 (39) 


251 (43) 



Table 5: Analysis of 7200 quartets from the RAG gene of the Gruber et al. (2007) data set by 
codon position. Numbers of each of the three possible quartet topologies are shown with high- 
confidence quartets (> 0.95 confidence weight for SQ and SQi, and > 0.95 likelihood weight 
for ML and MLi) shown in brackets. 



position 


1st 






2nd 






3rd 






method 


GC-bias 


Other 


True 


GC-bias 


Other 


True 


GC-bias 


Other 


True 


ML 


3538 (2203) 


667 (228) 


2995 (1070) 


4819 (2418) 


2279 (814) 


102 (0) 


7174 (7132) 


(0) 


26 (7) 


MLi 


3206 (493) 


965 (132) 


3029 (72) 


4819 (517) 


2244 (295) 


137 (0) 


7069 (6863) 


7 (0) 


124 (29) 


SQ 


3334 (464) 


680 (80) 


3186 (731) 


4549 (1152) 


2418 (222) 


233 (55) 


7181 (5274) 


(0) 


19 (0) 


SQi 


1661 (197) 


918 (168) 


4621 (1149) 


4186 (875) 


2701 (386) 


313 (80) 


5795 (3044) 


294 (56) 


1111 (264) 



whenever invariant sites are present. SQi performs much better than SQ when invariant sites 
are present, and there is still a slight decrease in accuracy as the proportion of invariant sites 
increases; however this is similar to what occurs for MLi. On the Farris tree all the methods 
seem biased towards getting the tree correct; although in this respect SQi seems to be relatively 
less positively biased. On the Standard and Felsenstein trees, there are regions where both SQ 
and SQi are more accurate than ML or MLi. This occurs when the proportion of invariant 
sites is low and the base-composition bias is high, although, as has been previously noted in 
Galtier and Gouy (1995), ML is reasonably robust to violations of its model assumptions. SQ 



has similar accuracy to NJ with logDet distances. 



Case studies 
Mammal data 



For the Phillips and Penny (2003) data sets. Table |4| shows the performance of SQ and SQi 
compared to ML using a TN93 model both with and without invariants sites. Overall this data 
set produces far fewer high- confidence quartets than the [Gruber et al. (2007) data set. The SQi 
method with partitioning by codon position makes little headway in recovering the suggested 
true tree, with Marsupionta strongly preferred over Theria for 1st and 2nd codon positions. 
Third codon positions slightly favour Theria. MLi does little better than SQi at picking the 
Theria topology. Considering all quartets, SQi is slightly better for codon position 2, but worse 
for positions 1 and 3. Considering only high-confidence quartets, SQi is better for positions 1 
and 3 but slightly worse for position 2. 

The MRP tree constructed from high-confidence SQi quartets from each codon position 
(see Figure |5]), supports the Marsupionta hypothesis. Apart from this contentious point, it is 
congruent with previous studies. 



Possum data 



For the RAG gene of the Gruber et al. (2007) data set. Table |5j shows the performance of SQ 
and SQi compared to ML using a GTR model with and without allowing for invariant sites. 
None of the methods do particularly well: they more often choose the "GC-bias" tree than the 
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Table 6: Analysis of 7200 quartets from the Gruber et al. (2007) data set for a concatenation of 



1st and 2nd codon positions and for all codon positions. Numbers of each of the three possible 
quartet topologies are shown with high-confidence quartets (> 0.95 confidence weight for SQ 
and SQi, and > 0.95 likelihood weight for ML and MLi) shown in brackets. 



position 


1st 


and 2nd combined 


1st, 2nd 


and 3rd combined 


method 


GC-bias 


Other 


True 


GC-bias 


Other 


True 


ML 


4978 (3823) 


1112 (596) 


1110 (376) 


7197 (7196) 


(0) 


3 (0) 


MLi 


4923 (1292) 


1192 (276) 


1085 (22) 


7186 (7100) 


10 (3) 


4 (0) 


SQ 


4756 (1833) 


1140 (215) 


1304 (272) 


7198 (5653) 


1 (0) 


1 (0) 


SQi 


3378 (1099) 


1479 (258) 


2343 (493) 


6926 (4236) 


170 (7) 


104 (3) 



Table 7: Analysis of 7200 quartets from the non-RAG genes of the Gruber et al. (2007) data 
set by codon position. Numbers of each of the three possible quartet topologies are shown 
with high-confidence quartets (> 0.95 confidence weight for SQ and SQi, and > 0.95 likelihood 
weight for ML and MLi) shown in brackets. 



position 




1st 






2nd 






3rd 




method 


GC-bias 


Other 


True 


GC-bias 


Other 


True 


GC-bias 


Other 


True 


ML 


226 (20) 


179 (4) 


6795 (5026) 


1455 (342) 


1172 (79) 


4573 (2027) 


88 (0) 


618 (169) 


6494 (4894) 


Mli 


254 (0) 


341 (0) 


6605 (1418) 


1589 (57) 


1337 (8) 


4274 (1254) 


77 (0) 


657 (25) 


6466 (1611) 


SQ 


236 (0) 


67 (3) 


6897 (1721) 


557 (62) 


642 (42) 


6001 (1871) 


129 (2) 


463 (37) 


6608 (3262) 


SQi 


291 (8) 


81 (1) 


6828 (1637) 


446 (40) 


609 (32) 


6145 (1880) 


161 (15) 


528 (57) 


6511 (3330) 



true tree for all partitions, with the exception of SQi for codon position 1. When comparing all 
7200 quartets, SQi has more instances of the "true" tree and fewer instances of the "GC-bias" 
tree than MLi for all three codon positions. Restricting the comparison to high-confidence 
quartets, SQi picks more instances of the "true" tree than MLi for all codon positions, but 
for the 2nd codon position it also picks more of the "GC-bias" tree. For both ML and SQ 
adding invariant sites to the model reduces the number of incorrect trees chosen; accounting 
for invariant sites also reduces the number of high- confidence quartets overall. Results for 
combinations of partitions of the RAG gene are poor compared to analyses of single partitions 
(see Table [6]). All methods suffer a severe decline in performance for data that combines all 
three codon positions. 



For the non-RAG genes of the Gruber et al. (2007) data set. Table 7 shows the performance 



of SQ and SQi compared to ML using a GTR model with and without allowing for invariant 
sites. When comparing all 7200 quartets SQi, has more instances of the true tree and fewer 
instances of the bias tree both than ML and MLi; and this is true for all three codon positions. 
Intriguingly, for this example, adding invariant sites to the ML model increases the number 
of incorrect topologies chosen. Restricting the comparison to high- confidence quartets SQi, 
picks more true trees than MLi for all codon positions, but fewer true trees than ML without 
invariant sites. 

The MRP supertree formed from only high-confidence quartets from each codon position 
in both the RAG and non-RAG genes is shown in Figure |6] It recovers the presumed true 
relationship between the Marmosops genus and Glade B; the rest of the tree is also broadly 



congruent with the findings of Gruber et al. (2007). This is a very pleasing result as Gruber 



et al. (2007) found that analysis of this combined data under a variety of phylogenetic methods 
produced the artefactual grouping of clade B with clade I. However, if only high-confidence 
quartets from codon partitions of the RAG gene are considered, then the MRP tree does 
include the artefactual B+I grouping (results not shown). 

Efficiency 

While running the SQ and SQi methods on the case study data we timed some of the runs in 
order to check the efficiency of the python implementation of the method. For codon position 
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1 of the [Phillips and Penny (2003) data (3588 base pairs), the analysis of the 720 quartets for 
Table |4] took 90 seconds (approximately 0.125 seconds per quartet). For the combined data 
(10764 base pairs) the analysis of the 720 quartets took 98 seconds. 



Conclusions 



The squangles were originally presented in Sumner et al. (2008); Sumner and Jarvis (2009), 



here we provide the first investigation of how robust the squangle method is to violation of the 
IID assumption. We found that SQ is fairly robust to the presence of invariant sites for the 
standard (clock-like) simulation topology. However, with the Felsenstein topology performance 
degraded with only a small proportion of invariant sites. Encouragingly, a simple modification 
of the squangle method that estimates and removes invariant sites greatly improves overall 
accuracy. For data generated under a process that incorporates both a proportion of invariant 
sites and base-composition drift, the newly proposed SQi method is comparable to, or better 



than, maximum-likelihood under a homogeneous invariant sites model. Results from the Gruber 



et al. (2007) case study also indicate that SQi is more robust to changing base-composition 



than ML. 

We also introduced two new weighting schemes for quartets derived using the squangle 
method: one based on confidence in the quartet and other based on length of the internal edge. 
Restricting to high-confidence quartets improved accuracy in both the case studies. Combining 
high-confidence quartets from different genes and codon positions gave the presumed correct 



tree for the Gruber et al. (2007) case study; whereas other methods of analysing this data are 



typically misled by the strong base-composition bias in the RAG gene. It was disappointing 
that the approach of combining high-confidence quartets from different codon positions did 



not recover the "true" tree in either the Phillips and Penny (2003) case study, or the Gruber 



et al. (2007) case study when the data was restricted to just the RAG gene. Perhaps it was too 



hopeful to think that accounting for base-composition shift and invariant sites would be enough 
for unbiased phylogenetic estimation in these cases. We recall that while SQi can account for 
data that evolved under GM with invariant sites it is not expected to be consistent if there is 
rate- variation across sites. SQi will also not be consistent if there are mixtures of processes, e.g. 



as would be expected if different groups of sites were evolving under different models (Pagel and 



Meade, 2004; Kolaczkowski , 2004). The results for the case studies suggest that some of these 



extra unmodelled factors are probably at play. The case studies support one's intuition that 
partitioning by codon position should mitigate against the problem of mixtures of processes. 
To revisit the five questions posed at the end of the introduction: 

1. We have shown that viewing the squangles in a least-squares framework gives a natural 
way of assigning confidence-based weights to quartets. Furthermore, the case studies sug- 
gest that restricting to high-confidence quartets improves the ratio of correct to incorrect 
topologies; 

2. We illustrated the squangle method for tree-building by using them in conjunction with 
MRP, however there are many options for utilising quartet information to build trees or 
supertrees, and the output of the SQi method could be used as input to any of these; 

3. We have shown that there is a region of parameter space in which squangles, particularly 
with invariant sites accounted for, provide more accurate phylogenetic reconstruction than 
maximum-likelihood. As expected given the different assumptions of the SQi and MLi 
methods, SQi does comparatively better when there is base-composition bias, whereas 
MLi does better on data without base-composition bias; 
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4. We found that unmodified squangles were not particularly robust to the presence of 
invariant sites, particularly for the Felsenstein tree-shape; 



5. However, this could be remedied by applying the method of Steel et al. (2000) to estimate 
and subsequently remove a proportion of invariant sites. 

Overall our results show that SQi provides a complementary tool to existing methods in 
the phylogenetic toolkit. We suggest that the SQi method should be used alongside maximum- 
likelihood approaches in cases where base-composition differs across the taxa under considera- 
tion. 
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Appendix 

Here we derive the value of the squangle qi when evaluated on joint probability distribution 
P arising from the quartet 13|24 with GM substitution matrices Mj on the pendant edges and 
a JC substitution matrix with off diagonal entry | on the internal edge. The entries of P are 
represented as an array where Paggc is the probability of observing nucleotide states A, G, G 
and C at leaf 1 through to 4 respectively. We start by writing qi as a sum of two polynomials 



qi = — y (14,23)^ g^^^ recall the expressions given in Sumner and Jarvis (2009): 



y( \P) A ^ ^ -Psi2-PjiS-ffciA;2-Pi^i^2-^mi»n2fiiA;i£imi |Ci2fc2^2»n2 1 ) 

ji,kiA,mi,i2,k2/2,m2<^{A,C,G,T} 

f(U,23)(p\ _ X P2 P P P 1^ 1^ I 

J ) / J jj^i2 *^l'^2-' ^1^2-' »Ttim2 Fji/ci^imi |tj2fc2^2'Tl2 n 

ii,A:i/i,mi,i2,fe2/2,m2e{A,C,G,T} 

where A = ni=i clet(Mj); P(y4T) is the probability of observing the states A, T at the internal 
vertices of the quartet; PCET) = X]i6{yi cg t} \^ijke\ = if any of i,j,k and i are 
equal, and is 1 otherwise. 

Under the JC assumption for the internal edge, we have Pij = | (| + — y)); where 
Sij = 1 if 2 = j and otherwise, and Psi = -Pis = | for all i G {A,C,G,T}. We also notice 
that we can use the properties of \eijke\ to simplify the summations For instance, given any 
array X^jki, we have 

i,j,k,e(^{A,C,G,T} o-es 

where 5* is the set of permutations on the letters A, C, G and T. Applying the idea that underlies 
([3]) twice, we have 
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f''^^''^'^\P)=X ^ Pi:a2{A)Pai{A)j:Pai{C)a2{C)Pai{G)a2{G)Pai{T)a2{T) 

iTi,(T2GS 

^ ^ (i + ^<^i(CV2(C)(l-f)) (I +5ai(G)(T2(G)(l-f )) (t + 5cTi(T)(T2(T)(l-f )) , 

IT1,(T2S5 

and 

y(14,24)^p^ = A ^ -Pai(A)a2(A)-P(Ti(C)<T2(C)-Pai(G)a2(G)-Pai(T)a2(r) 

0'1,<T2 6S 

^ ^ XI (i +<^<^i{A)a2(A)(l-f ))^ (I +5<,,(C)<T2(G)(l-f )) (I +(^ai(G)a2(G)(l-f )) 

o'i,<T2e5 

• (I + (^ai(r)a2(T)(l-f )) ■ 

Prom here the problem is purely combinatorial. For instance, it is not difficult to show 
J2 1 = 4! • 4! = 24 • 24, 

(T1,<T2€S 

(5,,(^)<,,(^)=4!-3!=24-6, 

ai,a2€S 

<^<7i(A),72(A)5<7i(G)a2(G) = 4! • 2! = 24 • 2, 

(^(Ti(A)ct2(A)(^cti(G)(T2(G)(^(Ti(G)(T2(G) = 4! • l! = 24 • 1, 
(^(Ti(A)a2(A)(^ai(G)(T2(G)(^ai(G)a2(G)(^ai(T)(T2(r) = 4! • O! = 24, 

<Tl,cr2eS 

which wc use to expand the above expressions and carefully collect terms. This process gives 
the cubic 

y(i3,24) = ^ [24 • 24 • 1(1)3 + 24 • 6 • 3(1)^(1 - f ) + 24 • 2 • 3(|)(1 - f )2 + 24 • 1 • 1(1 - ff] , 
and, noting that Sfj — 5ij, the quintic 

j(i4,23)^p^ = A [24 • 24 • 1(1)^ + 24 • 6 • 5(1)^(1 - 4f ) + (24 • 6 • 1 + 24 • 2 • 9)(|)3(1 - ff+ 



45 

(24 • 2 • 3 + 24 • 1 • 7)(|)2(1 - f )3 + (24 • 1 • 3 + 24 • 2)(|)(1 - f )^+ 

24-1(1 - f )5] 



Taking the difference gives 



qi{P) = A^a(3^ - 225a + 276a^ - 154a=^ + 32a^), 



as required. 
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Figure 2: The distribution of q2 and for data simulated on (a) the star tree, and (b) the 
Felsenstein tree. In each case, the quartet returned by the least-squares routine is shown by 
different plotting symbols: o 12 | 34, A 13 | 24 and + 14 | 23. 
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Figure 3: gs against internal edge length a. The blue dashed line shows the theoretical value 
according to equation ([T]), the dots show each simulation, the red triangles show the mean value 
for the simulations at that value of a. 
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Figure 4: Main simulation results. Each rectangle corresponds to the accuracy of a particular 
method (one method per row), on a particular tree type (one tree type per column). Within the 
rectangle the x-axis relates to the proportion of invariant sites increasing from left to right; and 
the y-axis relates to the GC bias, increasing from bottom to top. Hence, the bottom left corner 
of each rectangle corresponds to the most straight-forward case for phylogenetic inference. The 
darkest red shade (found for MP on the Felsenstein tree) corresponds to 0% accuracy, and the 
lightest shade (found for MP on the Farris tree) corresponds to 100% accuracy. Simulations 
settings shown here are c = 1000, short edge a = 0.02, long edge a = 0.4. 



23 



— 



"Armadillo 
"FinWhale 
"Hippo 

"Flying Fox ^ 
"HShoeBat J[ 
"Tarsier ^ 
"pika 

"Elephant 

"Aardvark 

"Platypus 

"Echidna c 
O 

'Opossum 

"Bandicoot I ^ 

"Brushtail OJ 

"Wallaroo ^ 

"Rook 

■Rhea 

" GrTurtle 

"EPtTurtle Q. 

"MoleSkink o 
i_ 

DJQ 



I 



o 



"Iguana 
"Trout 
"Dogfish 
"Salamander 
"Caecilian 



jure 5: MRP supertree of all high-confidence quartets found by the SQi method applied to 



2nd and 3rd codon positions of the Phillips and Penny (2003) data set 
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Figure 6: MRP supertree of all high-confidence quartets found by the SQi method applied to 
1st, 2nd and 3rd codon positions of the RAG gene and the 1st, 2nd and 3rd codon positions of 
the non-RAG genes from the Gruber et al. (2007) data set. 
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